Firstly, we load the datasets and required libraries
# Load libraries
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(car)
library(plotly)
# Read the data
df <- as.tibble(read.csv("gapminder_clean.csv", row.names =1))
df_1962 <- df %>%
dplyr::filter(Year==1962)
The plot above shows a correlation between selected values. Therefore, we are conducting pearson’s r calculation for them
cor.test(df_1962$gdpPercap, df_1962$CO2.emissions..metric.tons.per.capita., method = "pearson")
##
## Pearson's product-moment correlation
##
## data: df_1962$gdpPercap and df_1962$CO2.emissions..metric.tons.per.capita.
## t = 25.269, df = 106, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8934697 0.9489792
## sample estimates:
## cor
## 0.9260817
The calculation suggests, a strong positive correlation between gdpPercap and CO2 emissions with r=0.9260817 and p-value = 2.2e-16.
# Get all the years (time point) from a dataframe
years <- unique(df$Year)
# Calculate the correlation and store in a list (with years) for all year
res <- lapply(years, function(x){
filt_df <- df %>%
filter(Year==x) %>%
transmute(percap= gdpPercap, co2=CO2.emissions..metric.tons.per.capita.) %>%
drop_na()
corr <- cor.test(filt_df$percap, filt_df$co2, method = "pearson")
list(x,as.numeric(corr$estimate))
})
# Make a vector from a corrs
corrs <- sapply(res, function(x){
x[[2]]
})
# Find the year, which holds a max pearson's r.
max_year <- sapply(res, function(x){
if (x[[2]] == max(corrs) ){
x[[1]]
}
})
# Clean NULL elements from an above list. The max year is just a number now
max_year <- as.numeric(max_year[lengths(max_year) != 0] )
Therefore we have the max year stored in a variable max_year (=1967). Then we are constructing an interactive plot, using this year as a filter.
continent and Energy use variablesThe first thing, first - we should plot the data to get the know it
The plot suggests that we have one categorical variable and one quantitive The best statistical test for this kind of problem is ANOVA test. However there are several assumptions, we need to secure before applying test 1. Observations obtained independly and randomly 2. The data of each factor level is normally distributed 3. These populations have common variance
The first assumption is secured, we assume that these variables obtained independently and randomly. For the rest, we should run some tests
Let’s filter the data for the tests
df_energy <- df %>%
transmute(continent=continent, energy=Energy.use..kg.of.oil.equivalent.per.capita.) %>%
drop_na() %>%
filter(continent != "")
To test this assumption, we would use Levene’s test from car package, because it’s less sensitive to departures from normal distribution, than the analogous Bartlett’s test
leveneTest(energy ~ continent, data=df_energy)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 12.424 8.003e-10 ***
## 843
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis in Levene’s test is that groups have common variance. We just broke it with p-value 8e-10. Therefore we cannot assume that the variance is common between population
We should also check if the data is normally distributed between continents. For that we should use Shapiro-Wilk test.
continents <- unique(df_energy %>% select(continent))$continent
res <- lapply(continents, function(x){
df_tmp <- df_energy %>%
filter(continent==x)
list(shapiro.test(df_tmp$energy), x)
})
for (i in res){
print(paste(i[[2]], "have W: ", i[[1]]$statistic, " with p-value ", i[[1]]$p.value))
}
## [1] "Europe have W: 0.896467349504789 with p-value 2.99948739634846e-12"
## [1] "Africa have W: 0.674723638575491 with p-value 2.3343308284197e-19"
## [1] "Americas have W: 0.563222686679644 with p-value 1.5868536782942e-21"
## [1] "Oceania have W: 0.981809862485087 with p-value 0.955266147969386"
## [1] "Asia have W: 0.660991122763317 with p-value 4.94311060721094e-19"
The data above suggest, that the normality assumption is violated. However, the Shapiro-Wilk test tends to pinpoint the slightest normality violation, if the data is big (>50 variables). Therefore, it always better to check with QQ-plots
The QQ-plots aligns with the results of the shapiro-wilk test. The only data for the Oceania have normal distribution. Just to be sure, let’s print the stats for each continent
group_by(df_energy, continent) %>%
summarise(
count = n(),
mean = mean(energy, na.rm = TRUE),
median=median(energy, na.rm = TRUE),
sd = sd(energy, na.rm = TRUE)
)
## # A tibble: 5 x 5
## continent count mean median sd
## * <chr> <int> <dbl> <dbl> <dbl>
## 1 Africa 199 699. 450. 627.
## 2 Americas 188 1704. 749. 2377.
## 3 Asia 185 1867. 760. 2590.
## 4 Europe 256 3146. 3028. 1734.
## 5 Oceania 20 3980. 4045. 1123.
The small amount of data partly explains the Oceania normality. Given that the Energy use grow with the passage of time, we expect non-normal distribution across 10 years.
So, we should non-parametric tests instead. The analogous to ANOVA non-parametric test is Kruskal-Wallis test
kruskal.test(energy ~ continent, data = df_energy)
##
## Kruskal-Wallis rank sum test
##
## data: energy by continent
## Kruskal-Wallis chi-squared = 318.68, df = 4, p-value < 2.2e-16
The Kruskal-wallis test us, that there is an statistically significant difference in medians with the populations. However we haven’t compared them pairwise. Because the normality assumption for 4 out of 5 groups is violated we could use pairwise comparisons with Wilcoxon rank sum test.
pairwise.wilcox.test(df_energy$energy, df_energy$continent,
p.adjust.method = "BH")
##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: df_energy$energy and df_energy$continent
##
## Africa Americas Asia Europe
## Americas 1.9e-14 - - -
## Asia 0.00016 0.09746 - -
## Europe < 2e-16 < 2e-16 < 2e-16 -
## Oceania 1.1e-12 9.4e-08 3.5e-07 0.00424
##
## P value adjustment method: BH
The results above suggest, that there is a significant difference in energy use over the 10 time points for 5 continents. The only not significant difference with p-value of 0.09746 is between Asia and Americas.
Let’s plot the final plot with all the stats included
##Imports of goods and services analysis
The point of this analysis is to check whether there is a significant difference in terms of imports between Europe and Asia in years after 1990 Let’s filter the data first
df_1990 <- df %>%
filter(Year>1990) %>%
transmute(continent=continent,import=Imports.of.goods.and.services....of.GDP. ) %>%
filter(continent=="Asia" | continent=="Europe") %>%
drop_na()
Let’s create a simple boxplot to get the sense of a data
We have two groups, and we check the difference between them. We cannot use one-sample t-test, because the groups are independent, there are no reference to compare one of them. Therefore, the Unpaired Two-Samples T-test is a better option. But we first need to meet the assumption for this parametric test: 1. Normal distribution within groups 2. Variance in these groups is equal
Let’s perform the normality check using shapiro-wilk test and using QQ-plot
continents <- unique(df_1990 %>% select(continent))$continent
res <- lapply(continents, function(x){
df_tmp <- df_1990 %>%
filter(continent==x)
list(shapiro.test(df_tmp$import), x)
})
for (i in res){
print(paste(i[[2]], "have W: ", i[[1]]$statistic, " with p-value ", i[[1]]$p.value))
}
## [1] "Asia have W: 0.854896735520448 with p-value 2.30969161420933e-08"
## [1] "Europe have W: 0.929048540991224 with p-value 1.36333544206785e-05"
And let’s explore the basic stats for each group
group_by(df_1990, continent) %>%
summarise(
count = n(),
mean = mean(import, na.rm = TRUE),
median=median(import, na.rm = TRUE),
sd = sd(import, na.rm = TRUE)
)
## # A tibble: 2 x 5
## continent count mean median sd
## * <chr> <int> <dbl> <dbl> <dbl>
## 1 Asia 98 46.8 39.8 33.5
## 2 Europe 114 41.8 37.8 16.7
We can see, that the normality assumption is violated. Shapiro-Wilk test gave the p<0.05 for each group. Given the sensitivity of this test, we plotted the QQ-plot, which, unfortunately, confirms the violation of this assumption.
Then let’s If the variances are equal between groups
var.test(import ~ continent, data = df_1990)
##
## F test to compare two variances
##
## data: import by continent
## F = 4.0189, num df = 97, denom df = 113, p-value = 3.696e-12
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 2.740396 5.927773
## sample estimates:
## ratio of variances
## 4.018893
As we see, the homogeneity of the variances rule is also violated (p-value<<0.05)
So, then we should use the non-parametric alternative, which is Unpaired Two-Samples Wilcoxon Test.
wilcox.test(import~continent,data=df_1990, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: import by continent
## W = 5707, p-value = 0.7867
## alternative hypothesis: true location shift is not equal to 0
Well, the difference between two groups is not statistically significant (p-value=0.7867).
And the final visualization^
Let’s transform the initial data, to save only population density, country and year
df_pop <- df%>%
transmute(year=Year, country=Country.Name, pop_den=Population.density..people.per.sq..km.of.land.area.) %>%
drop_na()
Let’s take a look at a data with an interactive plot.
We could tell right away, that the highest population density across all years is Macao SAR, China or Monaco. However, we cannot tell which a winner across all years.
# Get unique counties and years to a vectors
countries <- unique(df_pop$country)
years <- unique(df_pop$year)
# Make an empty dataframe with countries, which we will populate
coutry_rank <- as.data.frame(unique(df_pop$country))
colnames(coutry_rank) <- "country"
# Iterate over years, rank the population density, add ranks to a dataframe
for (i in years){
df_tmp <- df_pop %>%
filter(year==i)
order(df_tmp$pop_den)
data_ranked <- as.data.frame(cbind(df_tmp$country, as.numeric(rank(df_tmp$pop_den))))
coutry_rank <- cbind(coutry_rank, as.numeric(data_ranked[match(coutry_rank$country, data_ranked$V1), ]$V2))
}
# Rename columns
colnames(coutry_rank) <- c("country", years)
# Construct a matrix out of dataframe only with numbers (to compute mean)
rank_matrix <- coutry_rank %>%
select(-country) %>%
as.matrix()
coutry_rank <- cbind(coutry_rank, rowMeans(rank_matrix))
coutry_rank[is.na(coutry_rank)] = 0
colnames(coutry_rank) <- c("country", years, 'average')
# Print the rows of a dataframe with the highest rank (the bigger, the better)
coutry_rank %>%
filter(average==max(average))
## country 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 average
## 1 Macao SAR, China 252 252 253 252 252 253 257 259 262 261 255.3
## 2 Monaco 253 253 252 253 253 252 256 258 261 262 255.3
So, then we conclude, that the countries, that have the highest population density over years are Macao SAR, China and Monaco.
At first filter the data. Select years, life expectancy, country
df_life <- df %>%
transmute(year=Year, country=Country.Name, life=Life.expectancy.at.birth..total..years.) %>%
drop_na()
Look at the data.
The plot is less than informative. We should pinpoint the greatest increase in life expectancy since 1962. In other words, the greatest difference between 2007 and 1962. Let’s compute it
df_life <- df_life %>%
pivot_wider(names_from = year, values_from=life)
matrx <- df_life %>%
select(-country)
res <- apply(matrx, 1, function(x){
lst <- x[!is.na(x)]
diffs <- lst[length(lst)]-lst[1]
diffs
})
df_life$diff <- res
top_diffs <- filter(df_life, diff %in% sort(df_life$diff, decreasing = T)[1:10])[c(1,2,11,12)]
top_diffs
## # A tibble: 10 x 4
## country `1962` `2007` diff
## <chr> <dbl> <dbl> <dbl>
## 1 Bhutan 33.1 66.3 33.2
## 2 China 44.4 74.3 29.9
## 3 Iran, Islamic Rep. 46.1 72.7 26.6
## 4 Maldives 38.5 75.4 36.9
## 5 Nepal 36.0 66.6 30.6
## 6 Oman 44.3 75.1 30.8
## 7 Saudi Arabia 46.7 73.3 26.7
## 8 Timor-Leste 34.7 65.8 31.1
## 9 Tunisia 43.3 74.2 30.9
## 10 Yemen, Rep. 34.7 62.0 27.2
Lets make the plot one more time, but color the top countries this time